Implementation of max_batches parameter. #1087

freddyaboulton · 2020-08-20T18:27:02Z

Pull Request Description

Fixes #1005 by adding a max_batches parameter to AutoMLSearch. When both max_pipelines and max_batches are set, max_pipelines will take precedence.

After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

codecov · 2020-08-20T18:34:49Z

Codecov Report

Merging #1087 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1087   +/-   ##
=======================================
  Coverage   99.89%   99.89%           
=======================================
  Files         192      192           
  Lines       10853    10894   +41     
=======================================
+ Hits        10842    10883   +41     
  Misses         11       11

Impacted Files	Coverage Δ
evalml/automl/automl_search.py	`99.56% <100.00%> (+<0.01%)`	⬆️
evalml/tests/automl_tests/test_automl.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cbe513f...49f28aa. Read the comment docs.

freddyaboulton · 2020-08-20T19:49:07Z

evalml/automl/automl_search.py

@@ -76,7 +76,8 @@ def __init__(self,
                 n_jobs=-1,
                 tuner_class=None,
                 verbose=True,
-                 optimize_thresholds=False):
+                 optimize_thresholds=False,
+                 max_batches=None):


Not adding to docstring because we want to keep this "hidden" for now.

Yep, thanks. Will make it easier to make breaking changes when we next update the automl search UX.

Hm, after thinking on it for a bit, my preference would be to name this _max_batches and add a docstring entry, rather than simply omitting it from the docstring. Not a blocker.

I agree with this, adding the docstring now will also help us with development and keep track of what this is for!

I'll rename it to max_batches and add it to the docstring!

dsherry

Awesome! This is going to make the perf testing a lot easier :)

I left a comment on making the parameter private as opposed to unlisted (feel free to push back if you disagree), and some minor test suggestions.

dsherry · 2020-08-21T16:50:07Z

evalml/automl/automl_search.py

@@ -76,7 +76,8 @@ def __init__(self,
                 n_jobs=-1,
                 tuner_class=None,
                 verbose=True,
-                 optimize_thresholds=False):
+                 optimize_thresholds=False,
+                 max_batches=None):


Yep, thanks. Will make it easier to make breaking changes when we next update the automl search UX.

dsherry · 2020-08-21T16:50:38Z

evalml/automl/automl_search.py

@@ -166,7 +167,7 @@ def __init__(self,
            raise TypeError("max_time must be a float, int, or string. Received a {}.".format(type(max_time)))

        self.max_pipelines = max_pipelines
-        if self.max_pipelines is None and self.max_time is None:
+        if self.max_pipelines is None and self.max_time is None and max_batches is None:


dsherry · 2020-08-21T16:52:55Z

evalml/automl/automl_search.py

+        self._max_batches = max_batches
+        # This is the default value for IterativeAlgorithm - setting this explicitly makes sure that
+        # the behavior of max_batches does not break if IterativeAlgorithm is changed.
+        self._pipelines_per_batch = 5


Ah got it, thanks.

dsherry · 2020-08-21T17:36:13Z

evalml/automl/automl_search.py

@@ -365,6 +373,8 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl

        if self.allowed_pipelines == []:
            raise ValueError("No allowed pipelines to search")
+        if self._max_batches and self.max_pipelines is None:
+            self.max_pipelines = 1 + len(self.allowed_pipelines) + (self._pipelines_per_batch * (self._max_batches - 1))


Makes sense. When we change our automl algorithm we'll have to remember to update this. I spent some time considering if there's a way to have IterativeAlgorithm compute this for us, but I think this is fine.

Can I clarify this calculation so I better understand? Is this 1 (baseline) + len(self.allowed_pipelines) for first batch + (self._pipelines_per_batch * (self._max_batches - 1)) for each batch thereafter?

Yes you got it @angela97lin !

ah gotcha, thanks @freddyaboulton! 😊

dsherry · 2020-08-21T17:36:26Z

evalml/automl/automl_search.py

@@ -377,7 +387,8 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl
            tuner_class=self.tuner_class,
            random_state=self.random_state,
            n_jobs=self.n_jobs,
-            number_features=X.shape[1]
+            number_features=X.shape[1],
+            pipelines_per_batch=self._pipelines_per_batch


dsherry · 2020-08-21T17:37:34Z

evalml/tests/automl_tests/test_automl.py

+        n_results = 1 + len(automl.allowed_pipelines) + (5 * (max_batches - 1))
+
+    assert automl._automl_algorithm._batch_number == max_batches
+    assert len(automl._results["pipeline_results"]) == n_results


Could you check automl.rankings and automl.full_rankings? Also please use the public accessor for results, i.e. automl.results

dsherry · 2020-08-21T17:39:22Z

evalml/tests/automl_tests/test_automl.py

+        # So that the test does not break when new estimator classes are added
+        n_results = 1 + len(automl.allowed_pipelines) + (5 * (max_batches - 1))
+
+    assert automl._automl_algorithm._batch_number == max_batches


Yeah we don't surface this publicly right now. You're reminding me we should eventually pin down an API for exposing this sort of information.

Please do:

assert automl._automl_algorithm.batch_number == max_batches assert automl._automl_algorithm.pipeline_number == n_results

Thanks for the tips! I was looking at IterativeAlgorithm instead of the base class.

dsherry · 2020-08-21T17:39:42Z

evalml/tests/automl_tests/test_automl.py

+
+@patch('evalml.pipelines.BinaryClassificationPipeline.score', return_value={"Log Loss Binary": 0.8})
+@patch('evalml.pipelines.BinaryClassificationPipeline.fit')
+def test_max_batches_plays_nice_with_other_stopping_criteria(mock_fit, mock_score, X_y_binary):


dsherry · 2020-08-21T17:40:38Z

evalml/tests/automl_tests/test_automl.py

+def test_max_batches_must_be_non_negative(max_batches):
+
+    with pytest.raises(ValueError, match="Parameter max batches must be None or non-negative. Received {max_batches}."):
+        AutoMLSearch(problem_type="binary", max_batches=max_batches)


dsherry · 2020-08-21T17:43:27Z

evalml/automl/automl_search.py

@@ -76,7 +76,8 @@ def __init__(self,
                 n_jobs=-1,
                 tuner_class=None,
                 verbose=True,
-                 optimize_thresholds=False):
+                 optimize_thresholds=False,
+                 max_batches=None):


Hm, after thinking on it for a bit, my preference would be to name this _max_batches and add a docstring entry, rather than simply omitting it from the docstring. Not a blocker.

angela97lin

LGTM! Was curious about implementation for my own understanding but the impl looks great, tests are fantastic :D

angela97lin · 2020-08-21T17:49:08Z

evalml/automl/automl_search.py

@@ -76,7 +76,8 @@ def __init__(self,
                 n_jobs=-1,
                 tuner_class=None,
                 verbose=True,
-                 optimize_thresholds=False):
+                 optimize_thresholds=False,
+                 max_batches=None):


I agree with this, adding the docstring now will also help us with development and keep track of what this is for!

angela97lin · 2020-08-21T17:51:02Z

evalml/automl/automl_search.py

@@ -365,6 +373,8 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl

        if self.allowed_pipelines == []:
            raise ValueError("No allowed pipelines to search")
+        if self._max_batches and self.max_pipelines is None:
+            self.max_pipelines = 1 + len(self.allowed_pipelines) + (self._pipelines_per_batch * (self._max_batches - 1))


Can I clarify this calculation so I better understand? Is this 1 (baseline) + len(self.allowed_pipelines) for first batch + (self._pipelines_per_batch * (self._max_batches - 1)) for each batch thereafter?

…teria

…tored.

freddyaboulton · 2020-08-21T20:50:36Z

@dsherry I addressed your comments! Thanks for the feedback.

dsherry · 2020-08-21T21:13:31Z

evalml/automl/automl_search.py

@@ -76,7 +76,8 @@ def __init__(self,
                 n_jobs=-1,
                 tuner_class=None,
                 verbose=True,
-                 optimize_thresholds=False):
+                 optimize_thresholds=False,
+                 _max_batches=None):


Thanks, I think this is a good pattern for us to avoid breaking changes but still communicate clearly to users 👍

dsherry · 2020-08-21T21:13:59Z

evalml/automl/automl_search.py

@@ -129,6 +130,8 @@ def __init__(self,
                None and 1 are equivalent. If set to -1, all CPUs are used. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used.

            verbose (boolean): If True, turn verbosity on. Defaults to True
+
+            _max_batches (int): The maximum number of batches of pipelines to search.


Great. Could also mention that max_pipelines and max_time have precedent, but not required

…atches in docstring.

dsherry · 2020-08-21T21:41:09Z

evalml/automl/automl_search.py

@@ -129,6 +130,9 @@ def __init__(self,
                None and 1 are equivalent. If set to -1, all CPUs are used. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used.

            verbose (boolean): If True, turn verbosity on. Defaults to True
+
+            _max_batches (int): The maximum number of batches of pipelines to search. Parameters max_time, and
+                max_pipelines have precedence over stopping the search.


dsherry

🚢 💨

freddyaboulton commented Aug 20, 2020

View reviewed changes

freddyaboulton marked this pull request as ready for review August 20, 2020 21:10

freddyaboulton requested review from angela97lin, dsherry, jeremyliweishih, eccabay and bchen1116 August 20, 2020 21:10

dsherry approved these changes Aug 21, 2020

View reviewed changes

angela97lin approved these changes Aug 21, 2020

View reviewed changes

bchen1116 approved these changes Aug 21, 2020

View reviewed changes

freddyaboulton added 6 commits August 21, 2020 16:02

Implementation of max_batches parameter.

15543f1

Add test case for test_max_batches_plays_nice_with_other_stopping_cri…

ae179b7

…teria

Updating release notes for PR 1087.

3b184db

Making test_max_batches_works less likely to break when code is refac…

6592ee7

…tored.

Update comment in why _pipelines_per_batch is set.

7312dcb

Renaming max_batches to _max_batches and updating tests.

823714f

freddyaboulton force-pushed the 1005-add-max-batches-to-automl-search branch from 1ad0771 to 823714f Compare August 21, 2020 20:03

dsherry reviewed Aug 21, 2020

View reviewed changes

Specifying that max_time and max_pipelines have precedence over max_b…

49f28aa

…atches in docstring.

dsherry reviewed Aug 21, 2020

View reviewed changes

dsherry approved these changes Aug 21, 2020

View reviewed changes

freddyaboulton merged commit 6cd61ff into main Aug 21, 2020

freddyaboulton deleted the 1005-add-max-batches-to-automl-search branch August 21, 2020 22:04

dsherry mentioned this pull request Aug 25, 2020

Release v0.13.1 #1101

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of max_batches parameter. #1087

Implementation of max_batches parameter. #1087

freddyaboulton commented Aug 20, 2020

codecov bot commented Aug 20, 2020 •

edited

freddyaboulton Aug 20, 2020

dsherry Aug 21, 2020

dsherry Aug 21, 2020

angela97lin Aug 21, 2020

freddyaboulton Aug 21, 2020

dsherry left a comment

dsherry Aug 21, 2020

dsherry Aug 21, 2020

dsherry Aug 21, 2020

dsherry Aug 21, 2020

angela97lin Aug 21, 2020

freddyaboulton Aug 21, 2020

angela97lin Aug 21, 2020

dsherry Aug 21, 2020

dsherry Aug 21, 2020

freddyaboulton Aug 21, 2020

dsherry Aug 21, 2020

freddyaboulton Aug 21, 2020

dsherry Aug 21, 2020

dsherry Aug 21, 2020

dsherry Aug 21, 2020

angela97lin left a comment

angela97lin Aug 21, 2020

angela97lin Aug 21, 2020

freddyaboulton commented Aug 21, 2020

dsherry Aug 21, 2020

dsherry Aug 21, 2020

dsherry Aug 21, 2020

dsherry left a comment

Implementation of max_batches parameter. #1087

Implementation of max_batches parameter. #1087

Conversation

freddyaboulton commented Aug 20, 2020

Pull Request Description

codecov bot commented Aug 20, 2020 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

angela97lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton commented Aug 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

codecov bot commented Aug 20, 2020 •

edited